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Most viruses in human skin are known to be human papillomaviruses (HPVs). Previous sequencing of skin 
samples has identified 273 different cutaneous HPV types, including 47 previously unknown types. In the 
present study, we wished to extend prior studies using deeper sequencing. This deeper sequencing without 
prior PCR of a pool of 142 whole genome amplified skin lesions identified 23 known HPV types, 3 novel 
putative HPV types and 4 non-HPV viruses. The complete sequence was obtained for one of the known 
putative types and almost the complete sequence was obtained for one of the novel putative types. In 
addition, sequencing of amplimers from HPV consensus PCR of 326 skin lesions detected 385 different 
HPV types, including 226 previously unknown putative types. In conclusion, metagenomic deep sequencing 
of human skin samples identified no less than 396 different HPV types in human skin, out of which 229 
putative HPV types were previously unknown. 

Analysis of metagenomes using deep sequencing has in recent years become a routine approach 1 . The 
bacterial community of the skin has been investigated through metagenomic sequencing targeting the 16S 
ribosomal DNA. The viral part of the microbiome has not been analyzed as thoroughly, because the 
viruses do not share a common consensus sequence that can be targeted by molecular methods 2 . 

However, metagenomic sequencing has revealed that >95% of the viral sequences present in skin samples 
belong to the Papillomavirus family, mostly to the |3- and y-genera 3 . There are now 197 different HPV types 
established at the International HPV Reference Center (www.hpvcenter.se, accessed on 2014-06-05), but putative 
novel HPV types are continuously being discovered 4 The cutaneous HPV types have been found in healthy skin 
as well as in different skin lesions such as squamous cell carcinoma (SCC), actinic keratosis (AK) and keratoa- 
canthoma (KA), in both immunocompetent and immunosuppressive patients 1018 . Some mucosal HPV types 
cause cervical cancer 19 , as well as vulvar, anal and penile cancers 20 , whereas some cutaneous HPV types cause skin 
warts and others are associated with SCC in patients with a rare immunosuppressive disease 21-22 . Virus-associated 
cancers have an increased incidence among immunosuppressed patients 23 . This fact has spurred intensive efforts 
to search for viruses in cancers that have an increased incidence among the immunosuppressed patients, but that 
are not known to have a viral etiology. The cancer form that is most highly increased in incidence among the 
immunosuppressed is non-melanoma skin cancer 23 . 

We previously performed metagenomic sequencing of 142 skin samples amplified only with whole genome 
amplification (that is, without prior PCR) using 454 and Ion Torrent sequencing technologies, where most of the 
viral sequences mapped to the HPV family and 7 and 12 HPV types were identified, respectively 13 . Classification 
of HPV is based on the sequence of the major capsid protein gene LI. The LI sequence of a new HPV type should 
be <90% similar to the LI gene in any known HPV type 24 . The HPV consensus PCR primer pair FAP59/64 
amplifies both mucosal and cutaneous types 25 . As HPVs have been reported to be the most common viruses in the 
human skin 313 , we have also described the use of deep sequencing of general primer HPV PCR amplimers as a 
method to detect additional HPV types 9,26 . In this study, we wished to perform analysis with deeper sequencing 
using the much more powerful Illumina sequencing technology to investigate if there might be additional viruses 
present in human skin. 

Results 

Metagenomic sequencing of whole-genome amplified skin lesions (without prior PCR). Swab samples from 
82 SCCs and 60 AKs skin lesions were subjected to whole genome amplification, pooled and sequenced using the 
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Illumina MiSeq sequencing platform. We found a total of almost 100 
000 HPV reads (>99% of all viral reads) comprising 21 known 
established HPV types, 2 known putative types and 3 novel 
putative HPV types (Table I). The taxonomic definition of an HPV 
type is that the LI sequence of a new HPV type should be less than 
90% similar to the LI gene in any known HPV type. Known putative 
HPV types refers to sequences that are present in GenBank, but have 
not yet been cloned and hence, no official HPV type number have 
been assigned to them by the International HPV Reference Center 
(www.hpvcenter.se). In addition, human polyomavirus 6, Merkel cell 
polyomavirus, torque teno virus and human endogenous retrovirus 
were detected with between 7 and 205 reads each (Table I). 

Compared to previous analysis of the same sample pool using 
454GSFLX and Ion Torrent PGM technologies 13 , we obtained a 
260-fold and 35-fold larger number of viral reads using the 
Illumina MiSeq, respectively. The MiSeq analysis detected 26 estab- 
lished or putative HPV types, compared to 7 and 14 detected by 



GSFLX and PGM, respectively. 1 1 of the 14 HPV types detected by 
any of the formerly used technologies, the GSFLX and the PGM, were 
detected by the Illumina MiSeq. Only 3 previously detected types 
(HPV45, HPV59, FA73) were not detected by MiSeq. However, these 
3 HPV types were all only detected with a single read each during 
previous sequencing runs with the 454 or Ion Torrent technologies 13 . 

For one of the known putative types, SE46, the complete sequence 
was obtained from a total of 4881 reads (maximum coverage of 273). 
From the previous GSFLX and PGM runs, SE46 was sequenced with 
22 reads (maximum coverage of 5) and 132 reads (maximum cov- 
erage of 18), respectively with only a partial sequence obtained 13 . In 
this study, the Illumina MiSeq sequencing increased the sequencing 
depth approximately 220 times for this particular virus and the com- 
plete genome of the type SE46 was obtained (Figure 1). The genomic 
organization was similar to that of established y-papillomaviruses. 

In addition, partial sequences of the 3 previously unknown putat- 
ive HPV types (herein named SE355, SE356, SE357), were found 



Table 1 [ Number of reads for the different virus types (established HPV types, known and novel putative HPV types and 
detected in pool D (142 multiple displacement amplified skin swab samples) by at least one of the platforms. *SCC 
carcinoma, AK = actinic keratosis. **The data from GSFLX and PGM platforms are previously published 13 
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Figure 1 | Coverage plot of the putative HPV type SE46 genome. Coverage is represented as percentages (maximum coverage/coverage at a 
particular position), comparing different sequencing methods with each other. Grey, green and red histograms correspond to genome coverage from the 
Roche 454GSFLX, Ion Torrent PGM and Illumina MiSeq platform, respectively. The data from GSFLX and PGM platforms are previously published 13 . 
Light blue lines represent putative open reading frames of SE46. The plot was generated using Circos visualization tool 40 . 



using the Illumina MiSeq platform (Table I). For SE355 we obtained 
a 7080 bp long contig, assembled from 369 reads, representing 
almost the complete HPV genome. The GenBank accession numbers 
for the 3 novel putative types (numbered SE355 to SE357) are 
KJ506870 to KJ506872. 

Known and novel HPV types found after HPV consensus PCR. 

Sequencing of the HPV-amplimers in the 3 pools of in total 326 skin 
lesions identified 159 known HPV types, out of which 52 were known 
established HPV types and 107 were known putative types (Table II). 
In addition, the deep sequencing identified 226 sequences of pre- 
viously unknown putative novel HPV types. 65% of the identified 
previously known HPV types (104/159) belonged to the y-genus, 
whereas 63% of the novel putative types (142/226) belonged to the 
P-genus (Table II). There were only small differences between the 3 
pools. They had similar distribution of different kinds of HPV types, 
both known and novel. For all 3 pools, more than half of all HPV 
sequences detected were novel putative types. The GenBank acces- 
sion numbers for the 226 novel putative types (herein numbered 
SE129 to SE354) are KJ506873 to KJ507098. 

The new HPV phylogenetic tree. 160 of the 226 novel putative HPV 
types identified by the MiSeq sequencing of the three pools of HPV 



amplimers contained >400 bp or >200 bp including the 3'-end of 
the HPV consensus PCR amplimers and are shown in the HPV tree 
together with all known established types and previously known 
putative types, labeled as SE-types (Figure 2). The novel putative 
types cluster together in the (3- and y-genera. One large group of 
17 novel putative types and 2 known putative types forms a new 
branch in the tree, belonging to the y-genus. Another group of 9 
novel putative types forms a new branch belonging to the P-genus 
(Figure 2). 

Comparison of results from different samples. The swabbing of the 
surface of human skin is intended to characterize the diversity of 
viruses present, but is less informative for making associations with 
skin diseases as viruses that are shed from other places on the body 
might be detectable in swabs of skin surfaces also on rather distant 
sites on the body. Skin surface swabs were used in the metagenomic 
analysis (pool D) as well is with deep sequencing of HPV general 
primer amplimers (pool C). The sequencing of amplimers was 
unquestionably more sensitive (352 different HPVs detected) 
compared to the metagenomic sequencing (26 different HPVs 
detected), but it is noteworthy that no less than 11 of the HPVs 
that were detected in the metagenomic sequencing were not 
detected when sequencing HPV general primer PCR amplimers 
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Table II | The number of different HPV types or putative types identified in the 3 pools of HPV consensus PCRamplimersfrom 326 skin lesion 
samples. Established (known, sequenced and numbered HPV type); Known putative (known HPV sequence, present in Genbank and 
different (<90% nt similarity in the LI gene) from other Established types or Known HPV sequences) and Novel putative (not present in 
Genbank and different (<90% nt similartity in the LI gene) from all Established HPV types or HPV sequneces present in Genbank). Pool A) 
frozen biopsies from 29 SCCs and 31 AKs, pool B) frozen biopsies from 91 KAs and pool C) swab samples from 84 SCCs and 91 AKs, 
where SCC = squamous cell carcinoma, AK = actinic keratosis and KA = keratoachantoma. *The number of known and novel types 
detected according to genera is based on the top hit sequence in BLAST 
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(Tables I and II) implying that these viruses were not effectively 
amplified by the general primers used. The metagenomic 
sequencing had an average coverage of the human of 2.5-fold, 
suggesting that viruses that are WGA-amplified about as effectively 
as the human genome had been detectable if present at about 1 copy 
per cell and that viruses that are WGA-amplified about as effectively 
as the HPV control plasmid had been detectable if present at about 
0.04 copies/cell. 

Two analyzed samples contained biopsies from either SCC and its 
precursor AK (pool A) or from KA (pool B). Most of the detected 
HPVs were present in about equal abundance in all the pools tested, 
suggesting that they were not specifically associated with any par- 
ticular skin disease. Viruses that were much more (> 10-fold number 
of reads) abundant in the pool with SCC and AK were HPV158, FA9, 
GC05, KC45, SE126, SE253, SE279, SE337 and SE341. Viruses that 
were much more abundant in the KA pool were HPV4, HPV77, 
HPV148, FA89, FA199,SE205, SE242, SE 243, SE265, SE270, 
SE273.SE274 and SE325. 

Discussion 

The aim of the present study was to investigate if deeper sequencing 
using the Illumina platform would extend the diversity of HPVs 
found to be present in human skin. Indeed, our study found that 
human skin contains an extreme diversity of cutaneous HPV types, 
establishing that the HPV types are far more numerous than prev- 
iously known. We have previously used first generation deep sequen- 
cing technologies to study the viral metagenome in skin, for example 
the 454GSFLX with original and titanium chemistries 9,14,15,27 as well 
as Ion Torrent PGM 13 . The current, much deeper, sequencing using 
the Illumina platform resulted in an approximately thousand-fold 
increase in sequence depth and revealed 229 previously unknown 
putative HPV types, resulting in a fundamental change in our under- 
standing of the diversity of HPVs. 

Impressively, deep sequencing appears to be a reliable method for 
detection of viruses in skin samples. In the pool with 142 skin sam- 
ples amplified only with whole genome amplification (but not with 
prior PCR), the deeper Illumina sequencing resulted in 200 times and 
25 times more total number of reads compared to previous metage- 
nomic sequencing using 454 and Ion Torrent technologies 13 and 
generated 260 times and 35 times more number of viral sequences 
as well as 4 times and 2 times more HPV types, respectively. 3 
previously unknown putative HPV types were only found using 
the Illumina technology and for one of them almost the complete 
sequence was obtained. These findings indicate that the Illumina 
deep sequencing technology is a useful method to obtain the com- 
plete picture of the viruses that are present in human skin. 9 out of 12 
viral types that were detected with only a single read using 454GSFLX 
or Ion Torrent PGM were confirmed by the Illumina, with many 



more reads. Only 3 of the 12 viruses previously detected by only a 
single viral read were not confirmed by the deeper sequencing. 

The whole genome amplification method used (GenomiPhiHigh- 
Yield) is intended for a random amplification of all DNA in a sample 
and is thus routinely used for amplifying linear DNA. However, the 
GenomiPhi reaction has a preference for circular templates, which 
will have made it easier to detect circular genomes in the pool amp- 
lified with WGA. The WGA amplified the human genome 25 times 
and a circular HPV plasmid an additional 25 times. Thus, although 
the WGA will have made it easier to detect the circular HPV gen- 
omes, it is unlikely that we would have missed linear viruses unless 
they were present only in very small amounts. 

Three of the pools were amplified with an HPV consensus PCR 
before sequencing. Although the amplimer length is 450 bp, some 
sequences obtained were longer or shorter than expected This could 
be due to the fact that these PCR primers are highly degenerate and 
may to some extent also bind unspecifically. 

The majority of the HPV types and putative types detected in this 
study belonged to the |3- and y-genera, but 1 1 mucosal types from oc- 
genus were also found, out of which one was a putatively novel HPV 
type. Both presence of anogenital oncogenic HPV types in skin sam- 
ples 28,29 and contamination of the skin by viruses originating from 
mucosal surfaces, mediated by the fingers, has been reported 30 . In this 
study, biopsies from SCCs and AKs were taken after tape-stripping of 
the skin surface 15 , to reduce the probability of detecting contaminat- 
ing viruses. 

Phylogenetically, the majority of the novel HPV sequences 
belonged to the P-genus (142/226) followed by the y-genus (83/ 
226). In previous studies, the majority of novel putative HPV types 
found belonged to the y-genus. Possibly, the viruses in the y-genus 
may exist at higher viral loads, making them easier to detect, wheras 
viruses of P-genus might be biologically different and be present in 
lower viral loads. It might intuitively be surmised that viruses present 
in high amounts are more pathogenic, but there is no data to support 
this notion. Alternative scenarios are possible where the many dif- 
ferent viruses present may interact. A first step that will enable 
investigations of this issue is the basic knowledge that these viruses 
are present - the basic and fundamental discovery of the paper. 
Although 22 of the HPVs detected appeared to be more common 
in either the SCC/AK or KA sample, the massive number of viruses 
detected suggests that the association may have been due to chance 
and that it needs to be investigated in further studies. 

The P-genus is already diverse with 45 completely sequenced HPV 
types established (www.hpvcenter.se). The y-genus has been growing 
rapidly and has surpassed the P-genus, with now 61 completely 
sequenced y types established (www.hpvcenter.se). It appears that 
some of the new HPV types detected in the present study form 
clusters outside the previously defined species. Two of them are 
particularly noticeable, a new branch in the P-genus formed from 
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Figure 2 | Bayesian phylogenetic tree based on the LI part of the complete 164 established HPV types (+ bovine papillomavirus type 3 and type 5) and 
160 putative novel HPV types (SE-types) that were >400 bp or contained >200 bp of the 3'-end of the amplimer. SE-types discovered using 
454GSFLX 9 ' 26 and using Illumina MiSeq are presented in blue and red colors, respectively. 



9 novel putative HPV types and a new branch in the y-genus formed 
from 17 novel putative HPV types. The second branch also includes 2 
previously detected putative HPV types 26 . However, whether these 
new branches do indeed represent new HPV species needs confirma- 
tion by cloning, sequencing and deposition at the international HPV 
reference center (www.hpvcenter.se). 

With increasing sensitivity and throughput of the sequencing 
technology, the possibility always exists that assembly algorithms 
may construct erroneous "chimeric" sequences by the assembly of 
two different sequences from different viruses. Genomic recombina- 
tion has been described for cetacean papillomaviruses 31 . Even though 
this has not been described for HPVs, multiple co-infections of 
related microorganisms can result in recombination 32 . Both naturally 



occurring genomic recombination and PCR-mediated recombina- 
tion may mislead phylogenetic analysis 32 . We have previously 
developed a bioinformatics pipeline, which detects and removes 
putative chimeras that may have resulted from PCR-mediated 
recombination between related templates 26,33 . The bioinformatics 
in this study was even stricter regarding chimera checking also for 
known viruses. Hence, some sequences reported in the previous 
studies 13-26 did not pass the chimera checking step in the data analysis 
pipeline in this study. The phylogenetic tree was constructed only 
with the sequences that had passed our most strict chimera checking 
algorithm. 

In conclusion, we demonstrate that the diversity of HPVs in 
human skin is far greater than previously known. Deep sequencing 
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using the Illumina platform is effective to detect a plethora of HPVs 
that are present in skin samples. As deep sequencing technologies are 
continuously evolving with increasing throughput and decreasing 
cost per base pair, it is likely that an extraordinary and expanding 
diversity of cutaneous HPVs will continue to be revealed. As the 
HPVs are much more diverse than previously known, this needs to 
be taken into consideration in the design of detection methods and 
when designing studies of HPV biology or epidemiology. 

Methods 

Samples. Swabs and biopsies of non- melanoma skin cancer lesions (SCC, AK and 
KA) from immunocompetent patients attending Swedish and Austrian hospitals 27 
and biopsies of KA lesions from both immunosuppressed and immunocompetent 
patients attending the Norwegian National Hospital in Norway 14 were used. The swab 
samples had been collected both from the top of the lesion and from normal adjacent 
skin by a pre- wetted (0.9% NaCl) cotton-tipped swab that was rolled on the lesion 
(within margins of the lesion) or on the normal skin respectively, and suspended in 
1 mL of saline. In addition, 2 mm diameter punch-biopsies had been taken after tape- 
stripping from each lesion as well as from normal adjacent skin 15 . Biopsies from SCC 
and AK had been HPV-positive in a previous study 27 . The DNA was extracted with a 
phenol-free method 25 for the Swedish/Austrian biopsies and with the QIAamp DNA 
Minikit (Qiagen, Germany) 14 for the Norwegian biopsies. Swab samples were not 
extracted, only frozen and thawed. Informed consent was obtained from participants. 
The study adhered to the declaration of Helsinki and was approved by the Ethical 
Review Committees of Karolinska Institutet and of Lund University (Sweden), 
Medical University Vienna (Austria) and Institutional Review Board in Oslo 
(Norway). All methods were carried out in accordance with the approved guidelines. 

DNA amplification and pooling. The samples were mixed into 4 pools. Pool A, B 
and C were used in a previous study 26 , where pool A constituted fresh frozen biopsies 
from 29 SCC lesions and 31 AK lesions, pool B fresh frozen biopsies from 91 KA 
lesions and pool C top of lesion swabs of 84 SCCs and 91 AKs. Prior pooling into A, B 
and C, the samples were PCR amplified with the HPV consensus primer pair FAP59/ 
64, targeting the LI gene of a-, [3- and y-HPVs, as described previously 25 , using 5 uL 
of each sample per reaction. For SCC and AK biopsies, only samples that were HPV 
positive by PCR 27 were used (SCC, n — 29 and AK, n — 31). A plasmid including the 
sequence of HPV type 12 was used as a positive control and a detection limit of one 
copy per microliter was found. The PCR amplimers from the three pools were 
purified using the MinElute spin column kit from Qiagen according to 
manufacturer's guidelines. Two columns per sample pool were used with a sample 
volume of 120 uL per column. The elution volume was 10 uL of EB-buffer. The two 
elutions from each sample were mixed, generating a purified PCR product of 20 uL. 
Pool D, also used in a previous study 13 , was a mix of swab samples from 82 SCC 
lesions and 60 AK lesions. The 142 swab samples were subjected to random whole 
genome amplification using the GenomiPhi High Yield kit (GE Healthcare, UK) and 
combined into pool D, where upon the DNA was ethanol precipitated and dissolved 
in 100 uL water. The 4 pools were quantified using QuantiFluor-ST (Promega, US), a 
fluorometric assay quantifying dsDNA, according to manufacturer's user guide. The 
concentrations were 22.4 ng/uL, 17.8 ng/uL, 19.6 ng/uL and 56.4 ng/uL for pool A, 
B, C and D respectively. 

Sample library preparation. DNA libraries for the 3 HPV consensus PCR amplified 
pools, A, B and C, were prepared using the TruSeq Nano DNA Sample Preparation kit 
according to the user guide revision A (Illumina) with the following modifications: as 
the samples consisted of approximately 450 bp long PCR products, the 
fragmentation, end-repair and size selection steps were omitted and hence, the library 
preparation started with adenylation of 3 '-ends. As Illumina's recommended input 
DNA is 100 ng and 200 ng for 350 bp and 550 bp insert size respectively, we used 
150 ng of the 450 bp long PCR product as input in a volume of 17.5 uL, diluted in 
resuspension buffer, for pool B (first pool to be sequenced). Due to low cluster density 
in the sequencing flow cell for pool B, we adjusted the input DNA for library 
preparation to 50 ng for pool A and C. Having too much DNA during adapter 
ligation will result in more fragments containing only one adapter, which will be part 
of library quantification, but not be able to generate clusters in the sequencing flow 
cell, leading to low cluster density. The index adapters AD007, AD002 and AD019 
were used for pool A, B and C respectively. 

DNA library of the random amplified pool D was prepared using the Nextera DNA 
Sample Preparation kit according to the user guide revision B (Illumina), starting with 
50 ng DNA in the tagmentation reaction. A MinElute spin column was used in the 
clean-up step according to the MinElute user guide (Qiagen, US) instead of the Zymo 
spin plate. 

The library pools were quantified with the QuantiFluor system as above and the 
library sizes were checked using the Bioanalyzer (Agilent). Pool A contained frag- 
ments between 200 to 1370 bp. The main peaks for pool B and C was at 566 bp and 
560 bp respectively, corresponding to the FAP-PCR amplimer size. Pool D contained 
fragments between 300 to 7000 bp with the main peak at 1200 bp. For pool A, B and C 
the mean DNA fragment sizes were estimated to be 500 bp and the conversion 
formula 1 ng/uL — 3 nM DNA was used. For pool C the mean fragment size was 



estimated to be 1-1.5 kb and the conversion formula 1 ng/uL — 1.5 nM DNA was 
used. 

Sequencing on the Illumina MiSeq. Denatured libraries at 20 pM from the FAP- 
PCR amplified pool A, B and C were spiked with 5% PhiX control and individually 
sequenced by paired-end 301 + 301 cycles on the MiSeq instrument using version 3 
reagent kit (Illumina, US). For the multiple displacement amplified pool D, 10 pM 
denatured library was spiked with 1% PhiX control and sequenced by paired-end 251 
+ 251 cycles on the MiSeq using version 2 reagent kit. The sequencing preparations 
were made according to the user guides Preparing Libraries for Sequencing on the 
MiSeq revision C, Reagent Preparation Guide revision A and MiSeq System User 
Guide revision K. 

The sequencing flow cell cluster density for pool B, first to be sequenced, was 
422 K/mm 2 , the total yield was 6.2 Gb, 68% >Q30 and 98% of reads were passing 
filter. The cluster densities for pool A and C were 1113 K/mm 2 and 1339 K/mm 2 , the 
total yields were 15.6 Gb and 18.6 Gb, 66% and 67% >Q30, 95% and 94% of reads 
passing filter - an improvement compared to pool B presumably due to the adjusted 
DNA input for library preparation. Pool D had a cluster density of 627 K/mm 2 , the 
total yield was 5.9 Gb, 86% were >Q30 and 97% of reads were passing filter. 

Analysis of sequences. Sequences were obtained from the MiSeq (Illumina) 
instrument. Indices, included in the Illumina adaptors, were used to assign the 
sequences obtained to the originating sample. The bioinformatic analysis started with 
quality checking, where sequences were trimmed according to their Phred quality 
scores 34 . Quality checked reads were then screened against the human reference 
genome hgl9 using BWA-MEM 35 and SOAP aligner (http://soap.genomics.org.cn) 
and reads with >95% identity over 75% of their length to human DNA were removed 
from further analysis. The rest of the sequences were normalized (http://ged.msu.edu/ 
papers/20 12-diginorm) to discard redundant data and reduce sampling variation and 
sequencing errors. The normalized dataset was then processed for assembly using the 
Trinity 36 , SOAPdenovo and SOAPdenovo-Trans (http://soap.genomics.org.cn/) 
assemblers into contiguous sequences (contigs). Reads before assembly were re- 
mapped to assembled contigs and the result was used to calculate number of reads for 
each assembled contig. The use of several assembly algorithms and re-mapping of all 
singleton reads to assembled contigs were used to validate assembly results 13 ' 37 . 
Assembled contigs were then subjected to taxonomic classification by comparing 
them against GenBank nucleotide database using paracel blast (www. 
strikingdevelopment.com) blastn to classify them as i) previously known sequences, 
ii) related to previously known sequences, or iii) unrelated to any previously known 
sequences. To identify possible artifactual "chimeric" sequences, contigs containing 
sequences originating from different viruses, all papillomavirus -related contigs and 
singletons were checked as described 38 . Shortly, the sequence that aligned to its most 
closely related sequence in GenBank was divided into three equal segments. If at least 
one of the segments differed in similarity to the corresponding overlapping parts with 
more than 5% (for example, if segment 1 was 88% similar and segment 2 was 94% 
similar) the sequence was considered as a "possible chimera". All analyses were 
performed using in-house R (www.R-project.org/) and python (www.python.org) 
scripts that run on a high performance (40 core, 2 TB RAM) Linux server. 

Phylogenetic analysis. BEAST vl .8.0 39 was used to construct a Bayesian Phylogenetic 
tree. 4 independent Markov chain Monte Carlo (MCMC) tests were run under a 
coalescent tree prior and relaxed uncorrelated lognormal molecular clock for 100 000 
000 generations, with a tree logged every 1000th generation. A maximum clade 
credibility tree was constructed by a tree annotator 39 after discarding the initial 25% 
generations as a conservative generalization of the "burn-in" phase. The analyses 
were restricted to sequences that were >400 bp or contained >200 bp of the 3 '-end 
of the amplimer. This restriction was necessary in order to avoid the possibility that 
non- overlapping sequences might derive from the same virus. 

Amplification of viral and human DNA by whole genome amplification (WGA). 

Because the samplesin pool D were amplified by WGA before sequencing, we 
quantified the amount of amplification of human DNA and of HPV DNA. To 
samples with human DNA (human placental DNA at 1 ng/ul), we added 20 copies/ul 
of HPV16 plasmid and amplified the samples with WGA using the GenomiPhi 
HighYield Ready-to-go kit (GE Healthcare) in the same manner as for the clinical 
samples. The amounts of human DNA and viral DNA were quantified using real-time 
PCR for Betaglobin and for HPV16, respectively. The human DNA was found to be 
amplified 26-fold, whereas the HPV16 DNA was amplified 679-fold. 
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